Datenvisualisierung 2

Darstellung der zusammenfassenden Statistik

Daniela Palleschi

Humboldt-Universität zu Berlin

2023-06-19

Wiederholung

Letzte Woche haben wir…

  • Maße der zentralen Tendenz (neu) kennengelernt ✅
  • Maße der Streuungsmaßen kennengelernt ✅
  • gelernt, wie man die Funktion summarise() von dplyr benutzt ✅
  • gelernt, wie man Zusammenfassungen nach (.by =) Gruppen erstellt ✅

Heutige Ziele

This week we will learn how to…

  • use facet_wrap() to plot more than three variables
  • visualise summary statistics
  • create multi-part plots

Lust auf mehr?

Einrichtung

pacman::p_load(tidyverse,
               here,
               palmerpenguins,
               ggthemes,
               patchwork)
df_penguins <- palmerpenguins::penguins %>% 
  drop_na()

Review: Visualising distributions

  • we did this in week 3 using
    • histograms (1 numerical variable)
    • density plots (1 numerical variable)
    • scatterplots (2 numerical variables)
    • barplots (categorical variables)
  • how many variables are represented in each figure in Abbildung 1?
    • what types of variables are represented in each plot type?

Abbildung 1: Different plots types to visualise distribution of raw data: histogram (A), density plot (B), scatterplot (C), stacked barplot (D), and dodged barplot (E)

Violin plots

  • we can also use violin plots, which are pretty trendy at the moment
    • basically a double-sided/mirrored density plot
  • violin plots are considered easier to read
    • as we’ll see later, they’re easy to layer with other plots too
Code
  df_penguins %>% 
  ggplot(aes(x = species, y = body_mass_g, fill = species)) +
  geom_violin(alpha = .2) +
  labs(title = "Violin plot",
       x = "Body mass (g)",
       y = "Count",
    fill = "Species") +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_minimal()
Abbildung 2: Violin plot: a mirrored density plot

‘Mirrored’ density plot

What does ‘mirrored’ density plot mean? Violin plots are literally just a double-sided density plot. Compare Abbildung 2 to ?@fig-density3. They show the same data and the same distribution, but the violin plot is simply a density plot on both sides, but without the density values printed along the axis.

Code
  df_penguins %>% 
  ggplot(aes(y = body_mass_g, fill = species)) +
  facet_grid(~species) +
  geom_density(alpha = .2) +
  labs(title = "Density plot"
       ) +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_minimal()
Abbildung 3: Density plot: same violin plot

Code
  df_penguins %>% 
  ggplot(aes(x = species, y = body_mass_g, fill = species)) +
  geom_violin(alpha = .2) +
  labs(title = "Violin plot",
       x = "Body mass (g)",
       y = "Count",
    fill = "Species") +
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_minimal()
Abbildung 4: Violin plot: a mirrored density plot

Visualising 3 or more variables

  • as we know, we can incorporate more variables by mapping them onto aesthetics (e.g., colour, fill, or shape)
  • Abbildung 1 did this by using colour (all plots) and shape (scatterplot) to visualise species or sex in addition to what was mapped along the x- and y-axes
  • adding too many variables into a single plot can make it diffcult to read
  • for example, how many variables are mapped in the following code?
df_penguins %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island)) +
  labs(
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Island"
  ) +
  scale_color_colorblind() +
  theme_minimal()
Abbildung 5: A cluttered scatterplot with 4 variables

  • four: flipper_length_mm (x-axis), body_mass_g (y-axis), species (color), island (shape)
  • this is a bit visually cluttered!

facet_wrap()

  • a nice way to split our data into different plots is by using the facet_wrap()
    • can be used to split one cluttered plot into separate panels based on a categorical variable
  • let’s try using facet_wrap() to divide Abbildung 5 into three panels, by island
df_penguins %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  facet_wrap(~island) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Island"
  ) +
  scale_color_colorblind() +
  theme_bw()
Abbildung 6: A cluttered scatterplot with 4 variables

  • what type of variables can facet_wrap() take as its argument(s)?
    • categorical! Each ‘category’ gets its own panel

facet_grid()

facet_wrap() is related to facet_grid(), which can take two categorical variables, one in columns and one in rows. The argument for facet_grid() is an equation: row~column. So, if we add facet_grid(sex~island) to our plot, we should see the data in plots grouped by sex in rows (one row for female, one row for male) and island in columns (one column for each island)

df_penguins %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  facet_grid(sex~island) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind() +
  theme_bw()
Abbildung 7: facet_grid(sex~species)

Representing summary statistics

  • last week we talked about summary statistics
    • measures of central tendency and measures of dispersion
  • now we will learn how to visualise some of these statistics
    • and learn some new ones

Boxplot

  • boxplots (sometimes called box-and-whisker plots) contain:
    • a box with a line in the middle
    • lines sticking out of the top and bottom of the box (the whiskers)
  • which represent:
    • thick line: the median, also called Q2 (2nd quartile; the middle value above/below which 50% of the data lie)
    • box: the interquartile range (IQR; the range of values between the middle 50% of the data lie), with the boundaries:
      • Q1 (1st quartile, below which 25% of the data lie)
      • Q3 (3rd quartile, above which 25% of the data lie)
    • whiskers: 1.5*IQR from Q1 (lower whisker) or Q3 (upper whisker)
    • dots: outliers (outside the IQR)
Abbildung 8: Boxplot of df_penguins (body mass by sex)

Image source: (winter_statistics_2019?) (all rights reserved)

Or, explained another way:

Image source: Wickham et al. (o. J.) (all rights reserved)

geom_boxplot()

  • we can produce boxplots with geom_boxplot()
df_penguins %>% 
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  theme_bw()
Abbildung 9: geom_boxplot()

  • how many/what times types of variable(s) does geom_boxplot() take?
    • 2 variables: 1 continuous, 1 categorical

Grouped boxplot

  • like a bargraph, we can produced grouped boxplots to visualise more variables
    • just map a new variable with colour or fill aesthetic
df_penguins %>% 
  ggplot(aes(x = species, y = body_mass_g, colour = sex)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Body mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_colour_colorblind() +
  theme_bw()
A grouped boxplot

Visualing the mean

  • boxplots show a measure of central tendency, and several measures of dispersion
    • median, IQR (Q1 and Q3), 1.5*IQR (whiskers), and outliers (dots)
  • but we typically want to describe the mean and standard deviation when drawing conclusions about differences between groups
  • how might we do this?

Errorbar plots

  • we can visualise the mean and standard deviation with errorbar plots
    • sometimes called interaction plots
  • these plots have 2 parts:
    • the mean, visualised with geom_point()
    • the sd, visualised with geom_errorbar()
  • the errorbars plot the range from 1 standard deviation above and below the mean (mean +/- 1SD)
Abbildung 10: Errorbar plot of df_penguins (body mass by sex)

Computing summary statistics

  • we need to first calculate the mean and standard deviation, grouped by whatever variables we want to visualise
    • let’s stick with body_mass_g by species and sex
    • how can we compute the mean and sd of body_mass_g by species and sex?
Code
df_penguins %>% 
  summarise(mean = mean(body_mass_g),
            sd = sd(body_mass_g),
            N = n(),
            .by = c(species,sex)) %>% 
  arrange(species, sex) %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling(font_size = 30)
species sex mean sd N
Adelie female 3368.836 269.3801 73
Adelie male 4043.493 346.8116 73
Chinstrap female 3527.206 285.3339 34
Chinstrap male 3938.971 362.1376 34
Gentoo female 4679.741 281.5783 58
Gentoo male 5484.836 313.1586 61
  • we have to feed this summary into ggplot2
    • without the table formatting from knitr and kableExtra!!!!
    • we can do this by saving the summary as a new object, or feeding the summary into ggplot directly with a pipe
# Create new object with summaries
sum_penguins <- df_penguins %>% 
  summarise(mean = mean(body_mass_g),
            sd = sd(body_mass_g),
            upper = mean+sd,
            lower = mean-sd,
            N = n(),
            .by = c(species,sex)) %>% 
  arrange(species, sex)
# Feed new object into ggplot
sum_penguins %>% 
  ggplot(aes(x = sex, y = mean, colour = species)) 

df_penguins %>% 
  summarise(mean = mean(body_mass_g),
            sd = sd(body_mass_g),
            upper = mean+sd,
            lower = mean-sd,
            N = n(),
            .by = c(species,sex)) %>% 
  arrange(species, sex) %>% 
  ggplot(aes(x = sex, y = mean, colour = species)) 

Plotting mean

  • we do this with geom_point()
sum_penguins %>% 
  ggplot(aes(x = sex, y = mean, 
             colour = species, shape = species)) +
  geom_point()

Adding errorbars

  • we do this with geom_errorbar()
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point() +
  geom_errorbar(aes(ymin=lower,ymax=upper)) 

  • we need to add the mapping aesthetics for the upper and lower limits of the errorbar
    • aes(ymin = mean-sd, ymax = mean+sd)
    • we used summarise to compute mean-sd (lower) and mean+sd (upper) for each group for us, so we can use those instead

Barplot of mean: stay away!

I implore you, do not plot means using error bars! You will very often see barplots of mean values, and others might even teach this in other courses, but there are lots of reasons why this is a bad idea!!

Firstly, they can be very misleading. They start at 0 and give the impression that data stop at the mean, when about half the data is (usually) above the mean.

  • recall the datasauRus package, which contains datasets with similar means, standard deviations, and number of observations
    • but very different distributions
  • Abbildung 11 shows the distribution of 5 of these datasets (top row), and the mean, sd, and number of observations for the variables x and y
    • you’ll see that the distributions look very different
  • for this reason, it’s a good reason to always visualise your raw datapoints regardless of what summary plot you produce (e.g., errorbar plots also hide a lot of data)
Abbildung 11: Datasets with the same means, sds, and Ns, but very different distributions

Customising

  • what customisations do you see in the code and plot?
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point(position = position_dodge(0.3),
             size = 3) +
  geom_errorbar(aes(ymin=lower,ymax=upper),
                position = position_dodge(0.3), 
                width = .3) +
  scale_colour_colorblind() +
  theme_minimal()

  • position = posiiton_dodge(0.3) tells ggplot2 how to position objects
    • position_dodge() means: move overlapping objects horizontally
    • importantly, you need to use position_dodge() for every geom_ that is supposed to be at the same location, and with the same value; otherwise they won’t be aligned
  • geom_point(size = 3): adjust the size of the points
  • geom_errorbar(width = .3): adjust the width of the errorbars
    • tip: I always give the same value to position_dodge() and geom_errorbar(width = ), this way the errobars always touch the ‘middle’ line (try changing either value to see what I mean)
  • scale_colour_colorblind(): use a colorblind-friendly colour scheme
  • theme_minimal(): cleans up the plot (we’ve also seen theme_bw(), more about themes here)

Multi-part plots

  • we can combine various types of plots to summarise our data but also provide the distribution
    • this is easiest when they use the same underlying data, like violin plots and boxplots
df_penguins %>% 
  ggplot(aes(x = species, y = body_mass_g, 
             colour = sex, shape = sex)) +
  geom_violin(aes(fill = sex), alpha = .1, position = position_dodge(.9)) +
  geom_boxplot(width = .2, position = position_dodge(.9)) +
  scale_colour_colorblind() +
  scale_fill_colorblind() +
  theme_minimal()

Multi-part plots

Abbildung 12: A violin-boxplot

Plotting different data

  • this is trickier when we want to plot summaries (like error bar plots) and distributions
    • errorbar plots take data summaries (mean, sd)
    • violin, boxplot, and scatterplots all take the raw data (each row = observation)
  • let’s try to add a scatterplot to our errorbar plot
    • this could be done several ways, e.g.,
      • taking a scatter plot and adding the mean and errorbar geoms
      • or taking our errorbar plot and adding a scatterplot geom
  • the latter is a bit simpler, so let’s try that

Add scatterplot to errorbar

  • use geom_point() with the data and aes() needed
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point(data = df_penguins, 
             aes(x = species, y = body_mass_g)) +
  geom_point(position = position_dodge(0.3),
             size = 3) +
  geom_errorbar(aes(ymin=lower,ymax=upper),
                position = position_dodge(0.3), 
                width = .3) +
  scale_colour_colorblind() +
  theme_minimal()
Abbildung 13: Scatterplot with errorbar

Customise scatterplot

  • with position_dodge()
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point(data = df_penguins, 
             aes(x = species, y = body_mass_g),
             position = position_dodge(0.3)) +
  geom_point(position = position_dodge(0.3),
             size = 3) +
  geom_errorbar(aes(ymin=lower,ymax=upper),
                position = position_dodge(0.3), 
                width = .3) +
  scale_colour_colorblind() +
  theme_minimal()
Abbildung 14: Scatterplot with errorbar

Add alpha value

  • so we can distinguish overlapping values
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point(data = df_penguins, 
             aes(x = species, y = body_mass_g),
             position = position_dodge(0.3),
             alpha = .4) +
  geom_point(position = position_dodge(0.3),
             size = 3) +
  geom_errorbar(aes(ymin=lower,ymax=upper),
                position = position_dodge(0.3), 
                width = .3) +
  scale_colour_colorblind() +
  theme_minimal()
Abbildung 15: Scatterplot with errorbar

Change position

  • position_jitterdodge() moves objects to not overlap
    • we can set dodge.width = .3 to match position_dodge() of errorbars
    • and jitter.width = to say how much we want the points to jitter
    • and geom_errorbar(size = 1) makes the errorbar lines thicker
sum_penguins %>% 
  ggplot(aes(x = species, y = mean, 
             colour = sex, shape = sex)) +
  geom_point(data = df_penguins, 
             aes(y = body_mass_g),
             position = position_jitterdodge(dodge.width = .3, 
                                  jitter.width = 0.3),
             alpha = .4) +
  geom_point(position = position_dodge(width =0.3),
             size = 3) +
  geom_errorbar(aes(ymin=lower,ymax=upper),
                position = position_dodge(0.3), 
                width = .3,
                size = 1) +
  scale_colour_colorblind() +
  theme_minimal()
Abbildung 16: Scatterplot with errorbar

Heutige Ziele 🏁

Heute haben wir gelernt, wie man…

  • use facet_wrap() to plot more than three variables ✅
  • visualise summary statistics ✅
  • create multi-part plots ✅

Aufgaben

Boxplot with facet

  1. Produce a boxplot of the df_penguins data, with:
    • sex plotted on the x axis and with colour or fill (choose one)
    • flipper_length_mm plotted along the y axis
    • island plotted in three panels using facet_wrap()
    • whichever theme_ setting you choose (e.g., theme_bw(); for more options see here)

Code chunk options

  1. Add a label to the figure (fig-...) and a caption (fig-cap:). Briefly describe the plot, using a cross-reference (@fig-... shows that…).

Multi-layered plot

  1. Try to reproduce Abbildung 17. Hint: You will need to add one geom_ and some labels to Abbildung 12.

Abbildung 17: A multi-layered plot

Patchwork

  1. Using the patchwork package (see week 3 notes), plot your boxplot and your errorbar/violin plots side by side. It should look something like Abbildung 18.
    • hint: if you want to add the “tag levels” (“A” and “B”), you need to add + plot_annotation(tag_level = "A") from patchwork

Abbildung 18: Combined plots with patchwork

Session Info

Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.3.0.386 (Cherry Blossom).

print(sessionInfo(),locale = F)
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magick_2.7.4         patchwork_1.1.2      ggthemes_4.2.4      
 [4] palmerpenguins_0.1.1 here_1.0.1           lubridate_1.9.2     
 [7] forcats_1.0.0        stringr_1.5.0        dplyr_1.1.2         
[10] purrr_1.0.1          readr_2.1.4          tidyr_1.3.0         
[13] tibble_3.2.1         ggplot2_3.4.2        tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] utf8_1.2.3            generics_0.1.3        xml2_1.3.4           
 [4] lattice_0.21-8        stringi_1.7.12        hms_1.1.3            
 [7] digest_0.6.31         magrittr_2.0.3        evaluate_0.21        
[10] grid_4.3.0            timechange_0.2.0      fastmap_1.1.1        
[13] Matrix_1.5-4          rprojroot_2.0.3       jsonlite_1.8.5       
[16] httr_1.4.6            rvest_1.0.3           mgcv_1.8-42          
[19] fansi_1.0.4           viridisLite_0.4.2     scales_1.2.1         
[22] cli_3.6.1             rlang_1.1.1           splines_4.3.0        
[25] munsell_0.5.0         withr_2.5.0           yaml_2.3.7           
[28] tools_4.3.0           tzdb_0.4.0            colorspace_2.1-0     
[31] webshot_0.5.4         pacman_0.5.1          kableExtra_1.3.4.9000
[34] png_0.1-8             vctrs_0.6.2           R6_2.5.1             
[37] lifecycle_1.0.3       pkgconfig_2.0.3       pillar_1.9.0         
[40] gtable_0.3.3          glue_1.6.2            Rcpp_1.0.10          
[43] systemfonts_1.0.4     highr_0.10            xfun_0.39            
[46] tidyselect_1.2.0      rstudioapi_0.14       knitr_1.43           
[49] farver_2.1.1          datasauRus_0.1.6      nlme_3.1-162         
[52] htmltools_0.5.5       svglite_2.1.1         rmarkdown_2.22       
[55] labeling_0.4.2        compiler_4.3.0       

Literaturverzeichnis

Nordmann, E., McAleer, P., Toivo, W., Paterson, H., & DeBruine, L. M. (2022). Data Visualization Using R for Researchers Who Do Not Use R. Advances in Methods and Practices in Psychological Science, 5(2), 251524592210746. https://doi.org/10.1177/25152459221074654
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (o. J.). R for Data Science (2. Aufl.). https://r4ds.hadley.nz/